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Abstract 

The tremendous expanse of search engines, dictionary 
and thesaurus storage, and other text mining applica- 
tions, combined with the popularity of readily available 
scanning devices and optical character recognition tools, 
has necessitated efficient storage, retrieval and man- 
agement of massive text databases for various modern 
applications. For such applications, we propose a novel 
data structure, INSTRUCT, for efficient storage and 
management of sequence databases. Our structure uses 
bit vectors for reusing the storage space for common 
triplets, and hence, has a very low memory requirement. 
INSTRUCT efficiently handles prefix and suffix search 
queries in addition to the exact string search operation 
by iteratively checking the presence of triplets. We 
also propose an extension of the structure to handle 
substring search efficiently, albeit with an increase in 
the space requirements. This extension is important 
in the context of trie-based solutions which are unable 
to handle such queries efficiently. We perform several 
experiments portraying that INSTRUCT outperforms the 
existing structures by nearly a factor of two in terms of 
space requirements, while the query times are better. 
The ability to handle insertion and deletion of strings in 
addition to supporting all kinds of queries including exact 
search, prefix/suffix search and substring search makes 
INSTRUCT a complete data structure. 

Keywords: String Indexing, Prefix, Suffix, Substring. 

1 Introduction 

Efficient manipulation of large sets of strings has emerged 
as a basic requirement for a growing number of applica- 
tions including search engines [34], port cataloging on the 



web [26], dictionary and thesaurus support [2, 27], news 
archive, document repository, mining XML databases [9, 
24], searching reserved words in a compiler [1], automa- 
ton searching [5], text compression [7], and indexing huge 
databases. To enhance the performance of retrieval and 
update queries, mechanisms reducing the storage space 
requirement, making them in-memory if possible, are crit- 
ical. With the tremendous improvement in scanning and 
optical character recognition technologies along with the 
efforts in internationalization and localization, the amount 
of textual data is beginning to explode. Storing such a vast 
amount of data itself poses a big problem. The further 
requirement of in-memory index structures for fast look- 
ups [35] calls for a compressed representation of even the 
index structure. 

Tries [17] and similar constructs try to achieve this by 
storing each character as a node in a tree and reusing 
some of the prefix nodes. Since each string is represented 
as a path from the root to a leaf, the memory require- 
ment is large [13, 32], thereby limiting their application 
for large text databases. Compact tries [32] and suffix 
trees [23, 29] aim to alleviate this problem by reusing 
the storage space of the common prefix or suffix of the 
strings. However, once two strings differ in a single char- 
acter, their paths differ, and they are stored separately even 
though the rest may be the same. In other words, these 
structures do not aim to reuse the characters forming the 
strings. As all strings are composed of a defined set of 
characters, reusing the storage space for common char- 
acters promises to provide the most compressed form of 
representation. This redundancy linked with the need for 
extreme space-efficient index structures motivated us to 
develop INSTRUCT (IN dexing STrings by Re-Using Com- 
mon Triplets). 

With the size of databases breaking the barrier of ter- 
abytes, efficient data mining operations call for fast tech- 
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niques for tackling prefix, suffix and substring searches. 
Prefix and suffix search queries allow context-based data 
retrieval. Data compression techniques, as in the sorting 
stage of Burrows-Wheeler transform [10] also utilize such 
searches. Even data clustering algorithms, like suffix tree 
clustering used in search engines make use of efficient 
suffix searching. Pattern or substring search is an impor- 
tant query operation in large genome and text data storage, 
and is used in software maintenance [6] and text editing 
among other related fields. 

We show that INSTRUCT efficiently handles such 
search queries, thereby making it a complete indexing 
structure. While the experiments show that INSTRUCT 
does not achieve industrial-scale (orders of magnitude) 
speed-ups over the competing structures, we feel that the 
ability of INSTRUCT to handle all string operations at a 
better or equal cost makes it a comprehensive structure for 
string databases. 

In a nutshell, our contributions are as follows: 

1. We have designed an intelligent structure IN- 
STRUCT for sequence indexing that reuses the stor- 
age space for common characters. 

2. We have depicted how different operations such as 
insertion and searching, including prefix, suffix and 
substring searching, can be efficiently supported by 
our structure. 

3. We have shown that INSTRUCT outperforms the ex- 
isting structures by up to a factor of two in memory 
requirements while maintaining better or comparable 
running times for searching and insertion. 

The paper is organized as follows. Section 2 provides a 
glimpse of the existing data structures for string manage- 
ment. Section 3 defines the structure of INSTRUCT. Al- 
gorithms for insertion, searching, etc. using INSTRUCT 
are presented and analyzed in Section 4. Section 5 reports 
the experimental results before Section 6 concludes. 

2 Related Work 

Although hashing [15, 28] provides the fastest way of in- 
dexing keys, the fact that the size of the hash table depend 
heavily on the data collision rate, coupled with no reuse of 
common character storage, often compels disk accesses, 
thereby limiting its efficiency. Moreover, it does not sup- 
port efficient prefix, suffix or substring search operations. 
Tries [2, 17] are tree-like structures that reuse the storage 
space for common prefixes, by storing each subsequent 
character separately as a node. Compact tries [22, 32] fold 



the tree path leading up to a single leaf node, i.e., a sin- 
gle suffix, into a single node. The suffix tree [23, 29, 33] 
and prefix tree [17, 21] respectively collapse the common 
suffix or prefix into single nodes, but with the increase in 
the number of unique keys stored, the length of such com- 
mon suffixes and prefixes decreases, whereby the struc- 
tures degenerate. Patricia tries [25] extend the concept of 
folding used by compact tries to single-branch nodes even 
within the tree structure to increase space efficiency, but 
uses optimizations to restrict false positive query results. 
Ternary search trees (TST) [8, 12] are 3-way tree structure 
with each branching node replaced by a binary search tree. 
This optimization makes the TSTs require less space than 
the standard tries [11], but also make them much slower. 
VLC-tries [18] and LZ-tries [30] do reduce the storage 
space required, but have significantly complex structures 
and procedures for querying, which are difficult to imple- 
ment. VLC-trie uses the divide-and-conquer method to 
obtain a partition of the edges of the trie into levels that are 
compressed. Dictionary compression methods like RLE, 
front-compression, and the LZ family [31] represent data 
in compressed form, and use Patricia tries, prefix trees, 
and LZ tries respectively. However, these methods have 
highly involved insertion procedure, and dynamic opera- 
tions are not well supported. For example, the basic trie 
structure does not support efficient substring searching, 
while prefix and suffix trees are biased towards only a 
subset of the family of search procedures. Several other 
similar structures such as the suffix array cater to this end. 
However, INSTRUCT inherently allows efficient search 
procedures for all the above methods with lower memory 
requirements. Burst trie [19] stores keys in buckets in- 
dexed by trie-like paths and dynamically splits (or bursts) 
the buckets during insertion. Although it is currently the 
most space-efficient structure [3], its performance varies 
widely with the heuristic for the choice of parameters gov- 
erning the bursting of the overflowing nodes. B-tries [4] 
provide a disk version of burst tries. 

The common space inefficiency of all these structures 
arise from the lack of reuse of storage for the individual 
characters forming the keys. INSTRUCT utilizes just a 
single node for each triplet of characters, and maps each 
triplet of a key into the corresponding node. It, thus, 
forms an efficient in-memory data structure. The keys 
are stored based on the 3-grams [16] present, with a unit 
window shift to obtain the next trigram. Indexing with 
INSTRUCT is therefore closely related to that using n- 
gram indexing [20]. In INSTRUCT, a set bit represents 
all strings containing the triplet, and there is no need to 
merge the results as in the case of n-gram indexing. This 
makes INSTRUCT simpler and faster. Further, the opti- 
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mizations achieved by reusing the space, and bit vectors 
that allow efficient pruning along with the robust range of 
operations supported makes INSTRUCT more attractive 
than the simple n-gram indexing. 

3 Structure of INSTRUCT 

We assume that keys (or equivalently, strings or words) 
that need to be indexed are sequences of characters from 
a alphabet of size k. We also assume that the maximum 
length of any key is at most I. For example, in an English 
dictionary, k — 26 and Z = 29 1 If there any m keys, the 
total number of characters in the database is d < ml. 

The INSTRUCT structure comprises a collection of k 
nodes, each corresponding to a particular character of the 
alphabet. Each node in turn comprises a k x k matrix. A 
cell in the matrix corresponds to a particular sequence of 3 
characters. We refer to this 3-character set as a triplet or a 
3-gram. The cell in the node ci at row c 2 and at column c 3 
represents the triplet cic 2 c 3 where Cj denotes a character 
from the alphabet. When a particular triplet is present in 
a key in the database, the corresponding cell is marked. 

However, a triplet may occur at different offsets in a 
key. It is thus beneficial to include this position informa- 
tion in the index. To enable indexing of positions, a cell 
is further broken up into an array of I elements, corre- 
sponding to I positions where a triplet can occur in a key 2 . 
When a triplet occurs, only the corresponding element is 
marked. This, we call the position array. 

Although INSTRUCT can naturally adapt to dynam- 
ically increasing string lengths, fixing the length ini- 
tially makes the representation simple as then all the 
structures — nodes, matrices, arrays — become regular ar- 
rays of fixed size, and the INSTRUCT structure can be 
very efficiently implemented as a 4-dimensional bit array 
where the bits can be directly accessed and the bit opera- 
tions easily performed. 

When a particular bit, at say, node ci , row c 2 , column 
C3, and position w is set, it indicates that there exists a 
key in the database with the triplet C1C2C3 at position w. 
Figure 1 shows the details of a matrix and a cell where 
k = 4 and I = 5. The INSTRUCT structure can be viewed 
as a hash table of triplets with position information. 

However, unfortunately, the INSTRUCT structure itself 
is not enough to disambiguate between all the keys in a 

1 The longest non-technical word in English is floccinaucinihilipili- 
fication (http: / /en.wikipedia. org/wiki/Longest_worcl_ 
in_English). 

2 Only I — 2 positions are needed, as there can be a maximum of I — 2 
triplets from a key of length I. However, we ignore this to simplify the 
discussion. 
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Figure 1: Internal structure of a matrix and a cell. 



database. To explain this, consider the following situation. 
Suppose only the keys ABCA' and 'DBCD' are present 
in a database. A search on the key ABCD' will now 
be successful as all triplets of ABCD', i.e., both ABC 
and 'BCD' are marked in INSTRUCT, and at correct po- 
sitions, too! The problem is that since only triplets are 
indexed, the history regarding the original string to which 
the triplet was a part of, gets lost. 

To alleviate the problem, INSTRUCT utilizes another 
l-element bit array called mark in each cell, similar to the 
position array. A bit in the mark array gets set for a triplet 
only when it is the last triplet in a key. Figure 1 shows 
how the mark array is maintained inside a cell. When a 
mark bit is set, a container is allocated that stores all keys 
that end with the triplet at the position corresponding to 
the mark element. The container may be a lexicographi- 
cally ordered list or a tree-based structure. We discuss the 
choice of container later. For the above example, the con- 
tainer for 'BCD' will only include the key 'DBCD', and 
therefore, a search for ABCD' will fail. The containers 
may also be stored in the disk, if necessary, and pointers 
to them are maintained within INSTRUCT. For search- 
ing and insertion, only the required container needs to be 
brought into memory. 

For non-string databases, INSTRUCT can be used to 
index the primary keys, while the pointers will be to the 
buckets containing the complete data stored on disk. 

The total space requirement of INSTRUCT is, thus, 
only 2k 3 l bits in addition to the actual keys (and asso- 
ciated objects). For the English dictionary, this translates 
to only 125 kB. It is interesting to observe that for a given 
value of k and I, all possible permutations of characters up 
to length I (i.e., k + k 2 H h k l = 0(k l+1 )) can be rep- 
resented in INSTRUCT with the same memory require- 
ment. This feature is quite novel, and makes INSTRUCT 
extremely robust and space-efficient as compared to other 
structures. Further, bit implementation allows simple bit 
operations such as AND, RIGHT SHIFT, etc. in the al- 
gorithms for searching and insertion (presented in Sec- 
tion 4), thereby making them extremely efficient. 

For extreme pathological cases, where the database is 
so huge that even this index cannot be accommodated in 
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the main memory, the individual nodes of INSTRUCT can 
be easily stored in the disk, as they are independently pro- 
cessed for the different triplets. The nodes (and corre- 
sponding containers) can be dynamically loaded. Using 
various caching and paging policies, the performance in 
such situations can be quite efficient. We do not assume 
such cases in this paper. 

4 Algorithms 
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Figure 2: Insertion of key 'ACAD': (a) first triplet 
('ACAD') and (b) last triplet (ACAD'). 



4.1 Insertion 

The insertion procedure into INSTRUCT is based on re- 
peatedly setting the correct position bits based on all the 
triplets present in the key. For the triplet C1C2C3 at position 
w in the key, the bit in the position array indexed by node 
Ci, row C2, column C3 and position w is set. If this is the 
last triplet of the key, i.e., C3 is the last character, then the 
corresponding mark bit is also set. If there is a container 
already pointed to by the bit (as there may be other keys 
in the database ending with C1C2C3 at w), the new key is 
inserted into the container. If there is no such container, 
a new one is allocated and the key is inserted. The set- 
ting of the bits can be efficiently implemented using bit- 
wise operators with appropriate bit masks. Without loss 
of generality, we consider that unique keys are inserted 
into INSTRUCT as primary keys are never duplicated. In 
the situation where keys may be duplicated, the containers 
will be implemented as a tree-based structure, and the in- 
sertion procedure will be replaced by a search-and-insert 
procedure where a key is searched initially, and is inserted 
only if it is absent. 

For keys of size 1 and 2, we maintain a special con- 
tainer, the size of which is bounded by k + k 2 . This han- 
dles the boundary conditions where no proper triplet can 
be formed. 

Consider inserting the key ACAD'. The first triplet 
is ACA'. Following the algorithm, posziion[A][C][A][l] 
is set (Figure 2(a)). In the next step, both 
position[C] [A] [D] [2] and mark[C] [A] [D] [2] are set (Fig- 
ure 2(b)). Since the key has ended, a container is allo- 
cated. All keys of the form '?CAD' are indexed in this 
container, where '?' stands for any character. As a fur- 
ther space optimization, since the last triplet, i.e., 'CAD' 
is common for all keys in the container, only the rest, i.e., 
A ', is stored. 

Inserting a key of length n requires setting n — 2 bits 
corresponding to the triplets in the key. Since array ad- 
dressing takes constant time, the time taken in this phase 
is 0(n). After the mark bit is set, the key is inserted into 
the container. Thus, the total time to insert a key is 0(n) 



+ (time to insert in container). The latter time depends on 
the nature of the container as well as its size. If the con- 
tainer is a list, e.g., a linked list or a dynamic array, inser- 
tion can be achieved in O(l) time. If, on the other hand, 
the container is organized as a tree-structure, e.g., a bal- 
anced binary search tree (BST), insertion takes 0(log s) 
time where s is the size of the container. 

4.2 Searching 

Searching a key in INSTRUCT follows the same proce- 
dure as insertion. For every triplet in the key, the cor- 
responding bit at the particular position is checked (again 
we use masks and bit-wise operators for this purpose). For 
the final triplet, the mark bit is also checked. If any such 
bit is not set, then the key cannot be in the database, and 
the search is terminated. So, there are no false negatives. 

However, even if all such bits are set, the container 
pointed to by the mark bit needs to be searched, as the 
bits may be set due to the presence of the key (success- 
ful search) or may be due to the presence of other keys in 
the database that together happen to contain all the triplets 
at the right positions (unsuccessful search). Thus, a sub- 
sequent search in the container is required to resolve be- 
tween the two cases. In Section 4.3, we estimate the prob- 
ability of such a false positive. 

Consequently, in the worst case, the time for searching 
a key of length n is 0(n) + (time to search in container). 
If the container is a linked list of size s, the latter time is 
O(s); if it is a BST, the time is 0(log s). 

Figure 3 shows the snapshot of a INSTRUCT structure 
storing the keys ABCDA, ADCDB', 'CCDA, and 'BC- 
DAAD'. Assume that the key ADCDB' is queried. For 
the first triplet ADC, we obtain the position bit vector 
from the corresponding node. It must contain a set bit at 
the first position. Since that is the case here, the position 
vector for the next triplet 'DCD' is checked, which has 
the second bit set. Moving forward, for the last and fi- 
nal triplet 'CDB', both the position and the mark vectors 
contain a set bit at the third position. Thus, the container 
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Figure 3: Example of INSTRUCT storing the keys 
'ABCDA', 'ADCDB', 'CCDA', and 'BCDAAD'. 



corresponding to the mark bit is searched. It should be 
noted that the last triplet is not stored in the containers as 
it can be obtained from the position of the mark bit. Con- 
sequently, the string 'ADCDB' is reported as present. 

If the key 'DCDB' is queried, the searching stops at 
the first step since the position vector for 'DCD' does not 
contain a set bit at position 1. A more interesting case 
is for the key 'ADCDA' . All the triplets have the position 
bits correctly set and the container corresponding to the 
final triplet 'CDA' is searched. This is an example of a 
false positive search using the INSTRUCT index only, as 
finally the container search returns a negative answer. 

The searching algorithm can also follow another strat- 
egy. Only the mark bit corresponding to the last triplet is 
examined. If it is not set, the search fails. Otherwise, the 
container is directly searched without checking the bits for 
the other triplets. This avoids traversing the length of the 
search key (i.e., the 0(n) time in the total cost). However, 
the chance that an unsuccessful search is terminated early 
is eliminated. On the other hand, for a successful search, 
this is always a better strategy. We call this the direct 
search strategy as opposed to the index search strategy 
otherwise. 

4.3 Analysis of searching 

We now analyze the chance of an unsuccessful search be- 
ing terminated early, and use that to devise the optimum 
search strategy. An unsuccessful search key of length n 
will be searched in a container if and only if for every 
triplet and position the key generates, the corresponding 
position bits are set, i.e., for every triplet C1C2C3 at posi- 



tion w, there is another key in the database with the same 
triplet C1C2C3 at the same position w. 

Since not all keys may be of length w, we denote the 
number of keys in the database having a length of at least 
w by f(w) and the probability that at least 1 out of m keys 
in the database contains character c\ at position why P w . 
Assuming all the characters to be equi-probable, i.e., the 
probability of occurrence of a character at any particular 
position is 1/fc, we get, 

P w = 1 — P(no key contains ci) 

= 1 — (P(key contains character other than Ci))^ w ^ 
= l-(l-l/fc) /W (1) 

The probability that a triplet appears at the position w 
is then the product of the three individual probabilities 
(since the corresponding events are independent): 

= (l - (1 - l/fc) /W ) . (l - (1 - l/fc) /(UJ + 1) ) . 

(l - (1 - l/k) i{W+2) ) 

w+2 

~ 1 — (1 — [ignoring higher order terms] 

i—w 

(2) 

Eq. (2) provides a way to compute the probability of all 
n — 2 triplets appearing at positions 1, . . . , n — 2. The last 
triplet, however, must also be the last triplet in some other 
key of the same length. Denoting the number of database 
keys that has a length of exactly w by g(w), Eq. (1) can 
be modified as: 

P We = 1 - (1 - l/k) 9{w) (3) 
Consequently, Eq. (2) can be modified to: 

w+l 

P We , 3 ~ 1 - £ (1 - - (1 - l/k f w+2) (4) 

i—w 

The occurrence of two consecutive triplets is not inde- 
pendent as they share two characters. However, for sim- 
plifying the calculations, we assume that the events are in- 
dependent. With this assumption, the probability P n that 
all the triplets of the search key of length n are present in 
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the database can be estimated as 

Pn = Pj,^ -Pn-2 e ,3 

n-3 ( j+2 \ 

= n i-e^-v*)™ ■ 

(1-2 (i-i/fc) /w -(i-i/fc) s(n) ) 

V i=n-2 ) 
^l- n f^(l-l/fc/« + (l-l/^ 

i=i *=j 

— (1 — l/k) 9 ^ 1 ' [ignoring higher order terms] 

(5) 

Since each of the /(i) and g(i) terms are bounded by 
to, P n can be upper bounded as follows: 

n-2 j+2 

P»<1-5;E(1- 1 A)™ = 1 - 3(n - 2) (1 - 1/fc) 

i=i *=j 

(6) 

Eq. (6) can be used to determine the optimal search 
strategy. Assume that searching for a key through IN- 
STRUCT takes T s time and that through a container takes 
T c time. For an unsuccessful key of length n, the search 
is terminated using the INSTRUCT index structure with 
probability (1 — P n ). Otherwise, with probability P n , the 
container is searched as well. Thus, the expected search- 
ing time for this index search strategy is 

T t = (l-P n )T s + P n (T s +T c ) (7) 

The alternate direct search strategy first checks whether 
the mark bit is set for the last (i.e., (n — 2) th ) triplet, and 
only if so, searches the associated container. The expected 
time, thus, is 

T d = P n - 2e . 3 T c (8) 

Thus, it is beneficial to search through INSTRUCT 
when 

Ti < T d 

or, T s < (P„_ 2e ,3 - Pn)T c (9) 

Using Eq. (6) and replacing f(i), g(i), etc. in Eq. (4) 

by to, 

T s /T c < 3(n - 3) (1 - l/k) m (10) 



When the length of an unsuccessful search key, n, in- 
creases, the probability of the search being pruned by IN- 
STRUCT increases, as it is less likely that all the triplets 
will be present at precisely the right positions. On the 
other hand, when the number of keys, to, is very large, due 
to the large number of triplets, it becomes more likely that 
there exists a triplet in the database at a particular position. 
As a result, searching through INSTRUCT wastes time as 
there will be little pruning. The size of the alphabet, k, has 
an opposing effect. When the number of possible charac- 
ters increase, it is less likely that a triplet will be repeated 
in the database, thereby making the chance of pruning an 
unsuccessful search higher. Eq. (10) confirms these be- 
haviors. Section 5 experimentally establishes them. 

4.4 Suffix Searching 

The suffix search procedure is almost the same as the ex- 
act key search, except for one crucial difference. For an 
exact string search, since the length of the search key is 
known, only the particular position bit is checked in the 
mark array corresponding to the last triplet of the key. A 
suffix, on the other hand, can end at any length and one 
particular mark bit cannot be checked. If, however, the 
lengths are known, then the suffix can be easily searched 
by iterating over all such possible lengths. The trick, 
therefore, is figuring out these lengths efficiently. 

Suppose the query suffix is c\c 2 - ■ - Cf. For the last 
triplet, i.e., c/_2C/_iC/, we check at what positions it 
ends in the mark array. If there is a mark bit set at po- 
sition p, it means that there exists a key in the database 
that ends at position p with the triplet c/_2C/_iC/. We 
next check the previous triplet c^-3C/_ 2 c /-i m the po- 
sition array. If a key contains both the triplets, then the 
position of the last triplet must be exactly one more than 
the position of the last but one triplet. Thus, for every 
set bit at position p in the mark array, if there is no set 
bit at position p — 1 in the previous array, there cannot 
be a key ending at position p containing both the triplets 
Cf-2Cf-\Cf and c/_3C/_2C/_i. Hence, the query cannot 
be a suffix ending at position p, and the position p can be 
removed from the list of possible positions. We continue 
in this fashion for all the triplets in the suffix. For all the 
mark bits that survive this pruning, we do a search in the 
corresponding containers. 

For efficiency purposes, the above operations are per- 
formed using bit vectors. The mark and position arrays 
are all bit vectors. To obtain all the p — 1 positions from 
the mark vector, it is RIGHT SHIFT-ed by one bit. The 
resulting vector is then AND-ed with the position vector 
of the previous triplet to obtain the new list of positions. 
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The RIGHT SHIFT and AND operations are done at most 
/ — 2 times for a suffix of length /. 

Consider searching the suffix 'BCDA' in the IN- 
STRUCT structure shown in Figure 3. The mark vector 
V in the node corresponding to the last triplet 'CDA' en- 
codes the probable ending positions for strings with the 
queried suffix. The previous triplet, 'BCD', is next con- 
sidered. Its position vector is RIGHT SHIFT-ed by one 
position and is AND-ed with V, setting the 2 nd and 3 rd 
bits of V. The containers attached to the last triplet 'CDA' 
at these positions are finally searched to return the string 
ABCDA. 

Searching an unsuccessful suffix such as ACDA' pro- 
duces an empty V vector as there is no ACD' triplet in the 
database. Consequently, we directly report that there are 
no strings with the queried suffix. If the suffix 'DCDA' is 
queried, only the 3 rd bit of V is set and the corresponding 
container is searched. Once more, this is an example of a 
false positive, as no key with the queried suffix is found. 

We now analyze the time complexity of this procedure. 
In the worst case, every mark bit is set and none of them 
gets pruned by the subsequent operations. For a suffix of 
length /, the complexity of performing the list operations 
is O(f.l), where I is the maximum length of a key. Finally, 
all 0(1) containers are searched. Hence, the total time for 
suffix search is O(fl) + 0(1) x T, where T is the average 
time for searching a container. 

4.5 Prefix Searching 

The prefix searching method exploits the fact that a pre- 
fix of a key is a suffix of the reverse of the key. Hence, 
we maintain a separate INSTRUCT structure where the 
reverse of every key in the database is inserted. A prefix 
search in the original space translates to a suffix search on 
the reverse INSTRUCT structure. This strategy, however, 
doubles the space requirements of INSTRUCT. 

4.6 Substring Searching 

A substring can be efficiently searched in INSTRUCT, 
albeit with an increase in the space requirements. The 
key idea is to note that any substring, when sufficiently 
shifted, becomes a prefix. Thus, if the amount of shifting 
is known, each key in the database can be shifted by that 
amount, and a prefix search can be issued on the shifted 
keys. This is precisely the idea that INSTRUCT uses. 

In addition to the original reverse INSTRUCT struc- 
ture, we maintain I — 1 extra reverse structures, Si, i = 
1, . . . , I — 1, where I is the maximum length of a key. 

Figure 4 shows the first reverse structure Si corre- 
sponding to the keys in Figure 3. When a key of length n 
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Figure 4: Example of extra reverse INSTRUCT structure 
for substring search. 



is inserted into INSTRUCT and its reverse is inserted into 
reverse INSTRUCT, n — 1 strings are extracted from the 
key in addition, by shifting one character at a time. The re- 
sulting strings are inserted into the corresponding reverse 
INSTRUCT structures. Suppose a key K = cic 2 . . . c„ is 
inserted. We create n — 1 strings from the key. The i th 
string Ki = c i+ ic i+2 ■ ■ ■ c n is inserted into Si. Although 
only a part of the key, i.e., Ki is used to index in Si, the 
containers of Si stores the entire key K. This is done to 
ensure that the original keys can be returned from Si after 
a successful search. 

The algorithm for substring search uses a similar strat- 
egy as the suffix search. When a substring c\C2 ■ ■ ■ c r of 
length r is queried, first, the positions where the last triplet 
c r -ic r -\c r are present are found by using the position 
array corresponding to the triplet. Note that this deviates 
from the suffix search as the position vector, and not the 
mark vector needs to be searched, since a key may not 
necessarily end with the substring. The triplets are then 
traversed backwards and all possible positions where the 
substring can start are found. Suppose the list of these po- 
sitions is L. For every position p e L, a prefix search with 
the substring is performed at the structure S p . The com- 
plete results on searching the various structures provides 
the entire set of database keys containing the substring. 

Consider a substring query for 'BCDA' in the IN- 
STRUCT structure shown in Figure 3. The position bit 
vector V of the last triplet 'CDA' includes all possible 
positions where the substring can end in a key. Next, V 
is then LEFT SHIFT-ed by one bit and bit-wise AND-ed 
with the position vector of the previous substring triplet, 
i.e., 'BCD'. The bits at position 1 and 2 of V are set in 
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this process. This implies that the substring can start only 
at positions 1 and 2 in a key. Hence, a prefix search 
with 'BCDA' is issued in the two reverse INSTRUCT 
structures corresponding to these positions, i.e., the (orig- 
inal) reverse INSTRUCT and the one-shifted reverse IN- 
STRUCT Si. The prefix searches in the two structures 
generate 'ABCDA' and 'BCDAAD' as the result. 

Next, consider an unsuccessful substring search on 
'CCDD'. Since there is no such triplet 'CDD' in the 
database, the search can be immediately terminated with- 
out accessing any of the reverse structures. This pro- 
vides a substantial advantage over other brute-force or 
trie-based methods. 

The chances of false positives, however, remain. For 
example, consider the substring 'DCDA'. The position 
vectors of 'CDA when LEFT SHIFT-ed and AND-ed 
with the position vector of 'DCD' yield position 2 as a 
possibility where the substring can occur in a key. Thus, 
a prefix search in the one-shifted reverse structure Si is 
issued. However, only an empty result set is returned. 

Storing the extra INSTRUCT structures increases the 
total space complexity to 2k 3 l 2 bits. For the English dic- 
tionary mentioned in Section 3, this evaluates to 3.5 MB. 
If there is not enough space in the memory to store all the 
reverse structures, the extra ones are stored on disk. These 
extra structures are invoked only for a substring search, 
and only if the corresponding offset is in the possible list 
of positions. As the extra INSTRUCT structures are in- 
dependent, the prefix searches in the different structures 
can be performed in parallel. The experiments reported in 
Section 5, however, do not use parallelization. 

In a sequential machine, the time for substring search 
is determined by the number of prefix searches and the 
time for each of them. So, the total time complexity for 
searching a substring of length r is (0(lr) + 0(1) x T)x 
(the number of prefix search positions found), where T is 
the average time to search a container. In the next section, 
we calculate the expected number of such prefix searches. 

4.7 Analysis of Prefix, Suffix, and Substring 
Searching 

The search procedures guarantee correct results by finally 
searching the containers that have a possibility of contain- 
ing an answer. An unsuccessful search may be generated 
if all the triplets present in the query are also present at 
the same position in other keys of the database. We now 
analyze the searching of suffixes, prefixes and substrings. 

Eq. (6) shows the probability that a particular string 
of length n is searched in a container. The probability 
Pprefix that the entire prefix of length s is matched, and 



an unsuccessful search is generated, can be deduced sim- 
ilarly: 

Pprefix < 1 - 3(a - 2) (1 - l/k) m (11) 

Note that here we are ignoring the positions where a pre- 
fix can start as we have bounded the number of keys at 
a position by its worst case, which is the total number of 
keys m. In reality, P pre fix is much less. The probabil- 
ity P su f fi x that the entire length of a suffix of length / 
is matched, and an unsuccessful search is generated is the 
same when f(i) and g(i) terms are bounded by m. 

The above equation also provides an upper bound of 
the probability that a search for a substring of length s is 
issued when it is not present in the database. We use this 
bound to analyze the substring searching. 

The substring search is actually a series of prefix 
searches. Each such search has an analysis as given by 
Eq. (11). The expected number of prefix searches that 
will be issued in the different INSTRUCT structures for a 
substring search is equal to the expected number of posi- 
tions in the final list after all the triplets of the substring 
have been traversed. 

We assume the event that the substring of length s oc- 
curs at position i to be independent of the event that the 
substring occurs at some other position j. Again, this is 
a simplification, as for long substrings or for short differ- 
ences in i and j, the events are not independent. Modeling 
the occurrence of the substring by binomial trials, the ex- 
pected number of positions where the substring occurs is 
given by the product of the total number of trials and the 
probability of success in each trial. The total number of 
trials is I as there can be I positions. The probability of 
success in each trial (i.e., position), is given by Eq. (11). 
The expected number of prefix searches is then 

I X Prefix < I X (1 - 3(S - 2) (1 - l/fc) m ) (12) 

When the largest length of a key, I, increases, the 
chance that a prefix search needs to be issued also in- 
creases. When the number of keys, m, increases, it be- 
comes more likely that a key in the database will have 
the queried substring, thereby increasing the number of 
searches. The length of the substring queried, s, has an 
opposing effect as more triplets need to be present before 
a search is issued in the container. Finally, when the size 
of the alphabet, k, increases, the chance that a particular 
triplet occurs decreases since the probability of a charac- 
ter matching with another decreases. 

4.8 Deletion, Updating, and Re-insertion 

When a key is to be deleted from INSTRUCT, it is first 
searched. If it is found, the deletion operation in the con- 
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tainer is performed. The corresponding mark bit is reset 
to only when the container becomes empty; also, the 
container is de-allocated. Updating a key involves delet- 
ing the key and then inserting the modified key, while re- 
insertion follows the same procedure as insertion. The 
time complexities of these procedures are bounded by 
those of insertion and searching. 

The mark vector and the position vectors remain filled 
up after repeated deletions and insertions. This poses 
a problem for searching as the pruning capacity of IN- 
STRUCT decreases. However, unlike the position bits, a 
mark bit can be reset to if the corresponding container 
becomes empty due to deletions. In any case, the time 
for searching in the container decreases even though the 
mark bit remains set (if the container does not become 
empty). Further, most string-based applications perform 
many more insertion and search operations than deletion, 
thereby rendering this a not-so-critical issue. 

5 Experiments 

In order to assess the performance of INSTRUCT, we con- 
ducted tests on multiple datasets and compared it with two 
other structures, burst trie [19] and compact trie [22, 32]. 
While there exists a number of other structures that sup- 
port string operations (see Section 2), the burst trie is re- 
ported to require the least amount of memory [3], while 
the compact trie is reported to be the fastest for exact key 
searching operations [22, 32]. Hence, we compared IN- 
STRUCT with these two structures only. 

We used two real datasets: (i) English dictionary (ob- 
tained from http : //www. outpost9 . com/files/ 
WordLists.html), and (ii) protein sequences from 
RCSB Protein Data Bank (PDB, http : / /www . rcsb . 
org/pdb/). We also used synthetic datasets to assess the 
scalability and practicality of our algorithms. The datasets 
were uniformly distributed random data (henceforth re- 
ferred to as Uniform dataset) and Zipfian distributed data 
(Zipfian dataset), both with varying parameters. Sec- 
tion 4.3 assumes a random distribution while many natural 
datasets such as the English dictionary follow the Zipfian 
distribution. 

The containers in INSTRUCT can be organized as a list 
or as a BST These two variants were compared against the 
two trie variants, burst trie and compact trie, with respect 
to the following parameters: (i) memory size, (ii) inser- 
tion time, and (iii) searching time for both successful and 
unsuccessful searches. We also measure empirically the 
probability of pruning the false positives during a search 
as well as show the results for prefix, suffix, and substring 
searches. These experiments were run on a 2.1 GHz desk- 



top PC with 2 GB of memory using C++ compiler on a 
Linux platform. Due to space constraints, we show only 
the representative results while complete results can be 
found in [14]. 

5.1 Real datasets 

Table 1 summarizes the two real datasets. Table 2(a) 
shows that the INSTRUCT structures require lesser stor- 
age space than the other two structures. The main com- 
ponent of the storage comes from the actual keys them- 
selves, and thus, the differences are very small. The in- 
sertion and search times are also better. Table 2(b), on the 
other hand, shows that the memory requirement of the IN- 
STRUCT structure becomes very large when the length of 
the keys are large. The overhead of maintaining bit vec- 
tors of length 2512 for every cell of the matrix requires 
about 10 MB of memory space. However, the insertion 
and search times are lesser than those for the burst trie. 
The pruning offered by indexing makes the search faster. 

Since the search performance of INSTRUCT depends 
on the number and size of containers, we measured the 
following additional parameters as well: (i) total number 
of containers, (ii) largest size of a container, and (iii) av- 
erage size of a container. 

The average size of a container shows how well the 
keys are spread. If this number is low, then the keys are 
well-distributed in the containers. Then, even when a con- 
tainer is accessed for a key that is absent in the database, 
the overhead of searching the container is less. In such 
cases, the choice of the list versus BST variants does not 
matter much. 

The other important factor for searching time is the 
false positive rate. It is measured as the number of times a 
container is accessed and searched for a search key that 
is not in the database, i.e., for an unsuccessful search. 
Table 1 shows that this ratio is almost negligible for the 
dictionary dataset. Thus, the index in the INSTRUCT 
structure can prune efficiently most of the unsuccessful 
searches without accessing the containers. Even for the 
protein dataset, about 84% of the unsuccessful searches 
are pruned. 

5.2 Uniform and Zipfian datasets 

For synthetic datasets, the important parameters affecting 
the performance of the algorithms are: (i) total number 
of keys, to, (ii) size of the alphabet, k, (iii) length of the 
longest key, I, and (iv) length of the query substring, n. 

The datasets were generated by controlling these pa- 
rameters. The length of each key was chosen randomly 
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Dataset 


Number of 

keys, m 


Number of 
symbols, k 


Longest key 
length, I 


Number of 
characters 


Max. size of 
a container 


Avg. size of 
a container 


False positive 
rate 


English dictionary 


179,935 


26 


45 


1,198,635 


601 


7.5 


0.019 


Protein sequences 


38,627 


21 


2512 


5,846,331 


205 


1.3 


0.161 



Table 1: Parameters and search performance for the real datasets. 



Index 


Total 


Time to 


Searching time 


Index 


Total 


Time to 


Searching time 


structure 


memory 


insert 


Succ 


Unsucc 


Total 


structure 


memory 


insert 


Succ 


Unsucc 


Total 


INS. BST 


1.50 MB 


1.42 s 


0.51 s 


0.54 s 


1.05 s 


INS. BST 


15.73 MB 


4.89 s 


2.28 s 


2.21s 


4.49 s 


INS. List 


1.50 MB 


1.29 s 


0.59 s 


0.58 s 


1.17s 


INS. List 


15.73 MB 


4.66 s 


2.44 s 


2.16s 


4.60 s 


Burst tr. 


1.53 MB 


1.61s 


0.64 s 


0.66 s 


1.30s 


Burst tr. 


15.89 MB 


5.64 s 


2.64 s 


2.67 s 


5.31s 


Compact tr. 


2.38 MB 


1.82 s 


0.65 s 


0.65 s 


1.31s 


Compact tr. 


25.71MB 


9.29 s 


2.70 s 


2.37 s 


5.07 s 



(a) (b) 



Table 2: (a) English dictionary results, (b) Protein sequence results. 



from 1 to I, and each character was chosen from an uni- 
form or a Zipfian distribution of k characters. Two-thirds 
of the keys thus generated were inserted in the structure. 
The rest one-third was used to trigger searches that were 
unsuccessful. Half of the inserted data (i.e., one-third of 
the total generated keys) was used to trigger successful 
searches. The prefix, suffix and substring were generated 
from the strings stored, starting from random positions 
and of varying lengths. 

5.3 Effect of number of keys 

With the increase in the number of keys, the size of dataset 
increases. Therefore, the memory requirement increases 
as well. However, the size of the multi-dimensional array 
index structure of INSTRUCT is independent of the num- 
ber of keys. It depends only on the alphabet set size and 
the length of keys. Hence, the growth in memory space 
is at most linear due to the actual key storage in the con- 
tainers. Figure 5(a) shows that INSTRUCT requires the 
least amount of memory and has a better scalability as 
compared to the burst and compact tries. 

Figure 5(b) shows the effect of number of keys on the 
insertion time for Zipfian data. As expected, the scalabil- 
ity is roughly linear for all the structures. As the number 
of keys increases, the average size of each container in- 
creases as well. This explains the widening gap in inser- 
tion times between the two variants of INSTRUCT. The 
burst trie performs the worst due to the nature of the burst 
heuristic. 

The next experiment measures the running time for 
searching both successful and unsuccessful keys. The per- 
formance of INSTRUCT suffers when a large number of 
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Figure 5: Effect of number of keys on (a) memory size 
and (b) insertion time. 



keys are present in the database (Figure 6(a)). The large 
number of false positives with the increase in the size of 
the database necessitates more searches in the containers. 
The large size of the containers degrades the search per- 
formance. The BST variant performs better than the list 
variant due to its superior arrangement of keys in the con- 
tainer. Modeling the list in a lexicographic order would 
help in boosting the performance of the list implementa- 
tion of the containers. 

To analyze the search time for unsuccessful keys of 
the direct search strategy versus the index search strategy, 
we measured the ratio of the number of searches pruned. 
Figure 6(b) shows the comparison of the ratio of prun- 
ing between the two strategies. The pruning for the direct 
strategy is almost constant while that for the index strat- 
egy decreases exponentially with the number of keys as 
indicated by Eq. (6). The figure also illustrates the fact 
that it is prudent to follow the direct search when there 
is a large number of keys as it is more likely that all the 
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Figure 7: Effect of largest key length on (a) memory size 
and (b) insertion time. 
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Figure 9: Effect of alphabet size on (a) memory size and 
(b) insertion time. 



triplets checked will be in the database and the search can- 
not be pruned (as the pruning factor for both the strategies 
roughly becomes the same), thereby reducing the actual 
search time. 

5.4 Effect of largest key length 

The next set of experiments measure the effect of the key 
length on the various algorithms. The number of point- 
ers in trie-based structures increases with the maximum 
length of the keys. Figure 7(a) shows that the increase 
in memory size with the largest key length is faster for 
the trie-based structures. In the case of INSTRUCT, only 
the lengths of the bit vectors increase and, thus, the size 
of the whole index increases linearly. However, the mem- 
ory requirement is mainly dominated by the actual storage 
of the keys, and therefore, the scalability is much better. 
Consequently, INSTRUCT requires lesser memory space 
(refer [14]). 

Inserting a key requires setting the bits corresponding 
to all the triplets in the index; so, the insertion time in- 
creases with the key length (Figure 7(b)). However, since 
trie-based structures invoke pointer chasing whereas IN- 
STRUCT uses direct array access, the insertion procedure 
in INSTRUCT is faster. 



Searching a key with a larger length has two opposing 
effects on the running time. On one hand, more number of 
triplets need to be checked in the structure. On the other 
hand, Eq. (5) shows that more the number of triplets in 
a key, the better is the chance of pruning it, thereby sav- 
ing the searching time inside a container. However, for 
successful searches, the time to search in the index is sim- 
ply an overhead, as the container will have to be searched. 
Thus, the total time for searching increases. Nevertheless, 
the searching times using INSTRUCT are smaller than the 
trie structures (Figure 8(a)). 

Figure 8(b) shows that the pruning produced by larger 
number of triplets in a longer key makes searching 
through the index perform better than the direct search. 
The increase in pruning is linear with the length of the 
key, as expected from Eq. (6), making the indexed strat- 
egy better for longer keys. 

5.5 Effect of alphabet size 

With the increase in the number of characters, the fanout 
of the trie-based structures increases. Due to this increase 
in the number of pointers, the memory requirement in- 
creases (Figure 9(a)). In INSTRUCT, even though the size 
of the index increases cubically, it is only in the order of 
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bits. Thus, the size of the memory increases only slightly. 

For a larger alphabet size, the spread of the keys be- 
comes better due to lesser number of collisions. Conse- 
quently, the burst trie undergoes lesser number of burst 
operations, and the total insertion time decreases with in- 
creasing alphabet size. Figure 9(b) shows that the inser- 
tion time for the compact trie, however, increases. The in- 
sertion time for INSTRUCT depends on the length of the 
key and the size of the container and is, therefore, mostly 
independent of the alphabet size. 

Figure 10(a) shows the searching time for different al- 
phabet sizes. For a small alphabet (k — 2), the false pos- 
itive rate is practically 1 and the container sizes are ex- 
tremely large. As a result, the searching time is large. 
When the alphabet size increases, this probability de- 
creases, thereby reducing the searching time. However, 
for large alphabet sizes, the size of the containers increase. 
Consequently, after k = 10, the structures show an in- 
crease in the searching time. 

The probability that a key which is absent in the 
database will still be searched in a container is given by 
Eq. (5). From the equation, we can see that more the size 
of the alphabet is, the lesser is the false positive rate. In- 
tuitively, with more characters to choose from, there is a 
lesser chance that the same triplet will be randomly cho- 
sen by a key in the database. Eq. (6) indicates that the 
amount of pruning should increase exponentially, and this 
is validated by Figure 10(b). Thus, the time for unsuccess- 
ful searches decreases when the alphabet size is increased. 
The effect is less prominent for the direct search strategy 
as it prunes only on the basis of the last triplet in a key. 

5.6 Effect of query length on prefix and suf- 
fix search 

The first set of experiments measure the running times for 
successful, unsuccessful and total search time for query 
suffixes of different lengths. When the presence of the 
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Figure 12: Effect of query suffix length on (a) total search 
time and (b) pruning. 



suffix in INSTRUCT is guaranteed, the direct search per- 
forms better than index search, as it bypasses the overhead 
of traversing through the entire length of the query suffix, 
as indicated by Figure 11(a). However, for unsuccessful 
searches, as the length of the query suffix increases, the 
number of triplets increases, producing a better pruning 
ratio for the indexed strategy. Thus, it performs better as 
shown in Figure 1 1(b). Figure 12(a) shows the total search 
time when both types of searches are issued. Overall, the 
index strategy performs better for larger query lengths. 

The prefix search experiments showed similar behavior 
and are, therefore, not reported. The effect of the other 
parameters are roughly equal as that of an exact key search 
(refer [14]). 

5.7 Effect of query length on substring 
search 

The substring search in case of INSTRUCT involves a 
collection of prefix search queries in the additional IN- 
STRUCT structures. Hence, the search strategies show a 
similar behavior as that of prefix search. However, as the 
prefix searches are done in a number of structures, for a 
successful substring search, the direct search will perform 
much better (Figure 13(a)) while for a unsuccessful sub- 
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Figure 13: Effect of query length on (a) successful and (b) 
unsuccessful substring search time. 
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Figure 14: Effect of substring length on (a) total search 
time and (b) pruning. 



string search, the index search will show significant im- 
provement (Figure 13(b)) due to the effect of better prun- 
ing of the containers that are searched for larger query 
substring lengths (as shown in Figure 14(b)). The total 
search time for both successful and unsuccessful queries 
is captured in Figure 14(a). 

5.8 Summary of experiments 

We can summarize the experimental observations as fol- 
lows: 

• Operating on an expanding database, the contain- 
ers of INSTRUCT should be implemented as a list 
allowing constant insertion time. For a relatively 
stable dataset, however, the BST implementation of 
the containers is preferred for efficient retrieval pur- 
poses. 

• For large databases (10 6 keys or more), the direct 
search performs better as it does not traverse through 
the index structure and the pruning ratio for both the 
strategies are almost equal. 

• When the search query length increases to more than 
9, it is better to use the index search strategy as the 
pruning offered is better. 



• When the alphabet size is more than 15, INSTRUCT 
is a better choice than other structures due to lower 
memory needs. 

6 Conclusions 

In this paper, we have designed a data structure, IN- 
STRUCT, that efficiently manages large sets of strings (or 
keys) and handles all the different string queries with low 
memory requirements. We described the indexing tech- 
nique used by INSTRUCT, and developed two variants — 
list and binary search tree — for the final container of the 
keys. We also developed algorithms for different key op- 
erations including exact key searching, insertion, dele- 
tion, updating, re-insertion, prefix/suffix searching and 
substring searching. We analyzed how the performance 
of the different searching operations and the probability 
of a search being pruned change with the number of keys, 
the length of the key and the alphabet size. Our experi- 
ments showed that INSTRUCT is better than the compet- 
ing structures in terms of memory size by up to a factor 
of two, while the insertion and searching times are either 
better than or comparable with. 

In future, we plan to investigate the effect of modeling 
the containers as different data structures such as a hash 
table, and also how parallelization of the different proce- 
dures improve the running time. 
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